[WIP] Moshi integration #33624

ylacombe · 2024-09-20T14:22:39Z

What does this PR do?

Moshi is the latest Kyutai model. It is a streaming speech-to-speech model, that can also do an inner dialogue (i.e it outputs text as well).

In particular, it means that Moshi deals with 3 streams of information:

The user's audio
Moshi's audio
Moshi's textual output

Similarly to Musicgen, audio is represented with audio codebooks, which can be interpreted like tokens. The main difference between text tokens and audio codebooks is that audio codebooks introduce an additional dimension of information.
Text tokens are typically of dim (batch_size, sequence_length) but audio tokens are of dim (batch_size, num_codebooks, sequence_length).

It's made of 3 components:

1. The main decoder (Helium in the paper)

Here, it corresponds to MoshiForCausalLM. It is strictly a classic text LLM, that uses an architecture similar to Gemma. In other words, it takes text tokens, embeds them, pass them through the decoder and a language head, to get text logits.

2. The depth decoder

On its own, it's also a classic LLM, but this time, instead of generating over the time dimension, it generates over the codebook dimension.

It also means that its context length is num_codebooks -> it can't generate more than num_codebooks.

Another interesting difference with a classic LLM is that each timestamp (here it correspond to each codebook) got its own set of Linear Layers and Embeddings.

3. Mimi

It's the audio encoder from Kyutai, that has recently been integrated to transformers, which is used to "tokenize" audio. It has the same use that Encodec has in Musicgen.

Architecture choice:

MoshiForCausalLM corresponds to the main decoder, it can be used as a textual LLM.
MoshiDepthDecoder is the depth decoder mentioned above
MoshiForConditionalGeneration encapsulates the main decoder, the depth decoder and the audio encoder.

Conceptually, MoshiForConditionalGeneration takes as input one stream of text and two streams of audio inputs - what the user has said so far, and what the model have generated so far - and generates two streams - a text stream and an audio stream.

How does it work:

-> The input streams are embedded and combined into inputs_embeds.

-> inputs_embeds is passed through the main decoder. There's nothing special done here, it's the same operation as Gemma or so on.

-> The main decoder outputs text logits but also its last hidden state which is called temporal context in the picture above.

-> the depth decoder switches the dimension on which we generate (codebooks instead of time). It uses the token generated from text logits and the temporal context to auto-regressively generate audio codebooks.

…ssor

ylacombe · 2024-09-20T14:29:25Z

src/transformers/models/moshi/modeling_moshi.py

+    "for speech-to-speech.",
+    MOSHI_START_DOCSTRING,
+)
+class MoshiForConditionalGeneration(MoshiPreTrainedModel):


cc @gante, this is the model that'll use a weird generation!

This is roughly how I envision generation. It works already, but there'll be some changes that will make the code a bit heavier

ylacombe · 2024-09-20T14:31:39Z

src/transformers/models/moshi/modeling_moshi.py

+        ).audio_values
+
+
+        return output_text_ids, output_values


I actually would like to allow dynamic outputs depending on the type of generation (beam, sample) etc., do you think I can do a nested ModelOutput ?

My suggestion would be to make it as close as possible to the return structure from the original generate. Users transitioning from other models to moshi would then have as little friction as possible 🤗

gante

the generation part makes sense to me!

suggestion: because it is quite convoluted with nested generate calls, adding a block diagram explaining the workflow and linking it to the docstring in def generate() will likely make the life easier for us (long-term maintenance) and our user (can quickly understand what's going on)

gante · 2024-09-20T14:55:33Z

src/transformers/models/moshi/modeling_moshi.py

+        ).audio_values
+
+
+        return output_text_ids, output_values


My suggestion would be to make it as close as possible to the return structure from the original generate. Users transitioning from other models to moshi would then have as little friction as possible 🤗

ylacombe and others added 9 commits September 13, 2024 11:14

clean mimi commit

c086231

some nits suggestions from Arthur

a544d27

make fixup

502865f

first moshi WIP

c858321

converting weights working + configuration + generation configuration

2eaadca

finalize converting script - still missing tokenizer and FE and proce…

016d538

…ssor

fix saving model w/o default config

34b6e24

working generation

50f9eb8

Merge branch 'main' into moshi-integration

81432c0

ylacombe commented Sep 20, 2024

View reviewed changes

gante reviewed Sep 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Moshi integration #33624

[WIP] Moshi integration #33624

ylacombe commented Sep 20, 2024

ylacombe Sep 20, 2024

ylacombe Sep 20, 2024

gante Sep 20, 2024

gante left a comment

gante Sep 20, 2024

[WIP] Moshi integration #33624

Are you sure you want to change the base?

[WIP] Moshi integration #33624

Conversation

ylacombe commented Sep 20, 2024

What does this PR do?

Architecture choice:

ylacombe Sep 20, 2024

Choose a reason for hiding this comment

ylacombe Sep 20, 2024

Choose a reason for hiding this comment

gante Sep 20, 2024

Choose a reason for hiding this comment

gante left a comment

Choose a reason for hiding this comment

gante Sep 20, 2024

Choose a reason for hiding this comment